Applied Data Science Capstone

Neighborhood Clustering in Toronto

In completion of requirements for the IBM Data Science Professional Certificate on Coursera

Daniel Nezich


Overview - Neighborhood Clustering in Toronto, CA

This project uses location and census data to group neighborhoods in Toronto, CA in order to assist, for example, in moving across town to a similar neighborhood or determining which area of the city might be most enjoyable to visit.

First, the locations to be grouped are defined as the Forward Sortation Area (FSA) regions of Canada Post. A list of FSA and corresponding neighborhoods in Totronto is read from Wikipedia. The geolocation service HERE is used to find a latitude and longitude for each FSA.

Second, features are generated for each location. These come in two forms: numerical data from the Canada 2016 Census, and a list of nearby venues from Foursquare. The numerical features are modified to approximate a normal distribution, and the list of venues is converted into categorical vectors. Various levels of categorization are prepared for exploration.

Third, a k-means model is used to fit the features. A range of feature sets is used to see how clustering performance is affected by feature choice, and for each feature set a range of cluster number is examined to determine the optimal cluster number via the gap statistic.

Fourth, an optimal model is selected and examined in detail, including determination of important predictive features by ANOVA p-value, cluster identification by average feature value, and cluster display on a map.

Imports

Table of Toronto Postal Codes (FSAs)

First we will use the Requests module to get the webpage containing the data we need.

We could inspect webpage.text and extract the tables with BeautifulSoup (see appendix), but parsing tables from the webpage text can also be handled by Pandas.

Let's inspect the tables automatically parsed from the page:

By inspection, the table we want is at index 0 (current Wikipedia formatting has been updated to break this parsing, but the Internet Archive has the prior version, so the analysis is retained and the link above updated).

Let's clean up the columns.

First, the Postal Code is expected to be unique:

So rows are uniquely indexed by the Postal Code, as desired, and we do not have to combine neighborhoods into a comma separated list as per the assignment instructions - they already are.

Second, we take only the Borough that are not 'Not Assigned':

So 77 Postal Codes were not assigned to a Borough.

We should also ensure that all Borough names are legitimate. This must be done manually.

Those all look like Borough names, and there are 10 total.

Third, we want each entry in Neighborhood that is 'Not assigned' to be the Borough name.

So all Neighbourhood entries look non-empty.

But on closer inspection there are entries that don't make sense for a neighborhood analysis, so we should drop them:

These will actually be removed in a later step where we eliminate postal codes that have no residential space associated with them (like govenment buildings, office buildings, commercial plazas). We will keep the full dataframe for now. The drop on inspection was previously (and incompletely) performed with the line:

df_cleaned = df.drop(index=[76, 92, 100]).reset_index(drop=True)

We conclude this section by displaying the final dataframe:

And finally, the postal codes dataframe shape is:

Latitude and Longitude for each Toronto Postal Code

After looking through many geopy and geocoder options (see Appendix), I chose to use Nominatim. While Nominatim doesn't succeed for partial postal code lookup, it did succeed with many neighborhoods. Sevearal missed assignments were due to using slightly different names, but there were still over 80 imperfectly assigned neighborhood locations.

I then tried using a commercial service. Google appears to charge for any developer use, but HERE has a freemium service which I did sign up for. With an API key I was able to get geopy to geocode a partial postal code, so I'll be continuing work there. Note that the geocoder module does not seem to work, possibly due to a misconfigured web address for requests (based on the errors returned).

To keep api keys secret, they are stored in a config.py file listed in the .gitignore file on GitHub.

To reload the configuration file should it need to be changed in development, run:

import importlib
importlib.reload(config)

For HERE, we need supply just the REST API key (on their site, called the app code)

Now we can try to geocode the postal codes

Offline inspection for missing values:

Wow, that was so much easier than trying to massage the Nominatim data (see Appendix).

Note that these three indices are the ones we previously considered for removal by manual inspection of the neighborhood list. We can remove them here.

Let's load up a csv file of locations and see if it matches.

What do these differences mean in terms of distance?

That's a very big difference! Though most are smaller, we'd still like some confidence in knowing which to use.

Manually inspecting Google Maps, the outline for M6A postal code has the HERE Lat/Long at the center, and the csv value near the left edge of the zone. For M8Y, both HERE and csv are a little off, though HERE is probably closer. Ideally with the postal code polygons we could find the center of mass, but that level of exactness won't help us.

Let's look at the distribution of nearest-neighbor distances, since this might inform our choice of neighborhood radius.

Large separations make sense for some of the larger neighborhoods, but is there something causing the short distances?

Let's inspect all postal codes with separation less than 'raidus' meters:

As could be expected, Downtown Toronto is the densest area of the city. First Canadian Place and Underground city are an office building and an extensive network of underground tunnels and shops - a hiccup for Euclidian analysis. What happens if we drop that neighborhood?

So there are still relatively small distances between neighborhoods in downtown Toronto. Let's stay with the original list anyway.

For feature generation, we can consider that larger areas (more separated neighborhoods) are probably less dense, and to get a sampling of amenities we may need to query on larger distances.

But before moving on to that, let's add a couple of extra interesting features not from Foursquare, such as population density.

Census Features

We collect some simple features from the 2016 census:

Begin by loading the relevant data into a dataframe:

Let's examine any missing values.

So four more irrelevant postal codes are found. Note that the previous three would appear in this list as well, as they also have no entry for land area.

Let's drop these four new rows, since they are clearly not relevant from a residential standpoint. It could be argued that there may be residences nearby, but as noted previously for Downtown Toronto, the density is high and the remaining postal codes should capture any neighborhood differences.

We should use the new population, dwelling, and area features to compute features more suitable for comparison:

Let's visualize how these new features are distributed:

So that our variables are near-normally distributed, we should keep log$_{10}$ Population Density, unmodified Dwelling Density, and log$_{10}$ of 1 - the Residential to Total Dwelling Ratio. The ranges will be adjusted using scalers during the classification. Interestingly, the Population Density shows some evidence of clustering (though with only single low bins between the peaks, this may just be an artifact.

Let's construct and keep those variables in the dataframe:

Let's also recompute the nearest neighbor distance and index, since the number of rows has been reduced since we last did that.

And the distribution of neighborhood distances is now:

So those close neighborhood distances had been due to zero-area postal codes. Now neighborhoods are separated by at least 500 meters, which makes sense for a length scale for city characteristics to change between (a 5 minute walk).

Let's visualize our final dataframe for this section, sticking with the values from the HERE API lookup (in df_c), dropping the GString as it has served its purpose for validation, and keeping the nearest neighbor information as it might prove useful in the next section.

Finally, let's view the final dataframe.

Venue Discovery and Feature Generation

We begin by loading Foursquare credentials and defining variables necessary for our queries.

We now define some utility functions for data acquisition and plotting.

We can now get nearby venues for each postal code. The Radius entry is the radius (in meters) that is required for the Foursquare API to return at least LIMIT/2 (50) venues. Radius is increased in increments of half the input radius until LIMIT/2 venues are returned if the extend parameter is True.

If the lookup fails due to quotas (or presumably timeout), the following cell can be run multiple times until successful completion (e.g. on multiple days should operation be interrupted due to exceeding quotas).

Let's make sure we have some features for every location:

Let's see the distribution of number of venues found in each postal code:

And the distribution of frequency of venue type across all of Toronto:

Now it seems there is a heirarchy of categories and we might do well to restrict ourselves to higher level categories. Let's get the category data structure:

And we can examine how many labels are at each level of the hierarchy:

This can quickly find the sum of branch and leaf categories at each depth in our data:

Alternatively, we can examine explicitly how many labels are leaf or branch nodes for all potential categories:

And then discover whether the Toronto venue data categories occur at branch or leaf nodes:

Venue Feature Creation

We can create one-hots both for leaf and node categories and to assign more weight in the clustering to lower depth nodes (to emphasize matching broader categories first, but this may get needlessly complex. We can generate the additional one-hot dataframes just in case, but then continue with the simpler analysis first.

Note that this could be done by limiting different parent categories to different depths of children, but that is too complex for now.

Note also that it would be reasonable to compact some venues (like zoo exhibits) into one entry in order to not overemphasize certain venues (like zoos), but this is also a bit beyond the current complexity level.

It may be useful to know how many categories we have to deal with at each depth:

From all this we expect that most the optimal category depth will be 1 or 2; depth 0 is quite general, while depth above 2 becomes too specific and adds little additional information (no information in the case of depth 4).

Note that the columns 'Venue Category' and 'Venue Category, Depth 4' are equivalent.

Create a venue vector for each neighborhood

It may be interesting to examine the venues occurring at highest frequency in each postal code. We do so below, but note that this is not used in feature generation in favor of the grouped onehot vectors. Using features created only from the highest frequency venue categories would be an interesting approach to try (some other time).

Returning to feature generation, it might be useful to include the overall areal density of venues as a feature. This can be done using the radius that was required to get the target number of venues returned when we listed the venues in each postal code. This method is imperfect because of the results massaging done by Foursquare, but is the best we can do with that data source.

The distribution of the logarithm of venue density looks more normal and therefore is expected to be better suited as a feature for clustering, so we will use that below.

Here we construct a list of inputs to our clustering algorithm, with the most relevant entry being various feature sets it will be interesting to fit to, including:

Note that giving different scale factors to various features might result in improved fits - this might be worth exploring in a future work.

K-Means Clustering

We will perform a cluster number optimization on all the feature sets previously specified to determine an optimal cluster number k and optimal feature set.

We will first use the elbow method to get an idea of the optimal k, looking at the mean distance to a centroid vs k to see the elbow in a continuously decreasing line. After that we will look at the gap statistic to get a more quantitative measure of the optimal k.

Elbow Method for Optimal K

We now plot the sum of intra-cluster sum of square distances to centroids. The elbow in these plots can be taken as the optimal k. We notice that:

Unfortunately, the elbow method falls victim to the issue of scaling - elbow location according to maximum curvature or matching slope of the data range (or the average angle between linear approximations of the beginning and ending of the data) have the caveat that an overall scaling factor changes the location of the match condition.

Gap Statistic for Optimal K

This section uses the gap statistic, which measures the difference between the sum of squared distances to assigned centroids for the data and for a similar featureset with uniformly distributed values (that is, without any cluster information). The optimal k is chosen as the lowest value of k for which the statistic is higher than that of the next value of k less the standard deviation of the statistic at the next value of k. In other words, k is chosen at the point of diminishing informational return from adding the next cluster. See here for further description.

Note that in this section text and graphics may not correspond due to nonreproducibility of the numpy random number generator; the text corresponds to one instance of execution.

First we define some functions to calculate and display the gap statistic.

Gap Statistic Variation with Number of Uniform Reference Distributions: Venues Only

Let's vary the number of reference distributions to see how the optimal k prediction becomes less noisy.

Notice how the standard deviation of the reference distribution increases with the number of reference distributions, but the variation of the mean decreases.

We see that the optimal k is almost 6, but more reliably 8 or 9. There is variation due to the state of the random number generator (though this was fixed for the k means fitting).

Gap Statistic Variation with Venue Feature Depth: Venues Only

Next let's look at how the venue depth changes the gap statistic. This is a bit unusual because the dimensionality increases greatly as the features change from a dense to a sparse vector. The appropriateness of the gap statistic for sparse features has not been determined, and is assumed suspect.

Notice that with increasing depth the optimal k increases: 6, 10, 13, 14, 14. *Note values may display differently due to altered random states.

Close-up of K Means Inertia with Gap-Statistic Optimal K

As the depth increases, this corresponds to an increasingly skew knee location choice.

Gap Statistic With Combined Features:

Let's examine other feature combinations we fit to above.

The optimum k here is 7 or 8, slightly higher than the venues-only choice, with lower variability than the venues-only curves.

Gap Statistic for Non-Venue Data

These features were processed to be nearly normal, so it is not surprising that they result in low optimal cluster numbers.

The optimal k is 2 when census data is involved, and 1 when only venue density is considered. This result is expected, since the features were adjusted to be relatively normal in the preprocessing stage.

Optimal K-Means Model

There are many ways to evaluate the optimal k, and evaluation of the best evaluation metric is beyond the current scope of this project. A good jumping-off point is the yellowbrick module, which implements several elbow detection methods. There are other sources pointing to the gap statistic and silhouette coefficient, as well as a host of other metrics, which can be used to choose optimal k and are often implemented in modules or libraries (see here and here). Adapting the statistics or preprocessing of the data for vector features instead of column features will require some thought, as examples I've found have seemed to assume normalized column features.

From the preceding section using elbow inspection and the gap statistic, we select dataset 10 (venue vector globally scaled to 1.4 range, census and venue density data scaled to 1.0 range), and k~8 as near-optimal model inputs.

Can the clustering procedure be refined any more?

Feature Pruning

Using the ANOVA F-test we can evaluate which features are important for the clustering decision.

We begin by defining some useful display functions:

For the selected feature set we display the p-values (probability that the feature distributions come from the global distribution under the assumption of normality). P-values below 5% are highlighted yellow, and indicate the feature is significant.

Clearly, the less important features are:

Let's see if the same trend shows for the venue vector at various depths:

Note that already at depth 2 we are getting some p-values of 0, indicating that overfitting is occurring for sparse labels (e.g. Construction & Landscaping at k=8). The largest value of k that avoids this overfitting is k = 6.

Final Dataset and Model

From the preceding section using elbow inspection and the gap statistic, we select dataset 10 (venue vector globally scaled to 1.4 range, census and venue density data scaled to 1.0 range), and k=8 as gap statistic-optimal model inputs (this allows feature Arts & Entertainment to be used at a decent significance level, as well as College & University at a marginal significance level).

We re-run the clustering a final time with only the most useful features at depth 0 and k=8:

We can see from the p-values that the significance is as expected.

We now display the means of each feature for each cluster, to gain some insight into what the clusters mean.

We can assign some descriptions to the clusters based on their categories above and knowledge of city structure:

  1. Mid-Urban Well-Rounded
  2. Suburban Shop-Focused
  3. Downtown
  4. Suburban Family & Outdoors
  5. Suburban Food-Focused
  6. Urban
  7. Suburban Well-Rounded
  8. Suburban Arts & Outdoors

Map of Clusters

We use Folium to display the clusters on a map.

Conclusion

We segmented the postal codes of Toronto into clusters using a k-means model acting on features derived from population and dwelling informatino from the 2016 census and nearby venues obtained from Foursquare. The optimal number of clusters was found to be ~5-10, and the clustering explored for the specific case of 9 clusters. The resulting map and cluster descriptions are reasonable.

For future reference and improvement, some potential issues with the analysis were noted. The Foursquare venue data might be spotty, and the magnitude of selection bias in the dataset is unknown - not all venues appear in the dataset, and it is likely that certain locations and types of establishments are preferenced by individuals who use Foursquare. For example, not all universities and colleges appear in the venue data, and a zoo had a large number of exhibits that give it an outsized weight. The FSA areas seem relatively large, and an approach using a finer grid of locations and smaller search radius might yield improvement. Alternative datasets might be more thorough, such as telephone directories (organized by business type), or provide other quality data (such as census population profiles with age and ethnicity categories like here where reporting at FSA level is an option). Alternatively, a local venue database could be constructed by grid query of foursquare, and venue vectors constructed based on proximity. Restricting venue data to zero category depth is also suboptimal - in the future higher category depths should be explored with concurrent feature elimination (e.g. eliminating features appearing in fewer than an arbitrary number of locations, or below an arbitrary threshold global count).

APPENDIX

Foray into BeautifulSoup for parsing webpages: obsolete because pandas.read_html() works fine.

Next we will parse the webpage using Beautiful Soup.

from bs4 import BeautifulSoup # had to install to environment in Anaconda

soup = BeautifulSoup(webpage.text, 'html.parser')

To view the HTML directly we could run:

print(soup.prettify())

But we can also inspect the webpage for the table we expect:

tables = soup.find_all('table') len(tables)

for i, table in enumerate(tables): print('Length of table {} string: {}, attributes: {}'.format(i,len(table.text),table.attrs))

import lxml import html5lib

str(tables[0])

df_pcodes = pd.read_html(str(tables[0]))[0]

df_pcodes = pd.DataFrame(df_pcodes[1:][:],df_pcodes[1:][1],df_pcodes[0][:])

type(df_pcodes) df_pcodes

Installation of dependencies: with relation to developing in Anaconda

conda install -c conda-forge geocoder

Gave problem with openssl-1.1.1h-he774522_0.tar.bz2

From Anaconda Navigator, I removed openssl, restarted, then installed it again.

Doing so seemed to reset the environment; I had to reinstall jupyterlab, pandas, numpy, lxml, html5lib, bs4.

I also discovered the Channels setting, and by adding conda-forge I was able to access packages that I installed through the prompt before: geocoder, jupyterlab-git, ipywidgets.

So there may be additional fallout from this, but at least I can use Aanconda as intended now that I can get the packages I need through the UI.

Geocoding

Unfortunately, geocoder.google('Ottawa, ON') and other services from geocoder failed to return data (over 200 explicit calls and over 20 minutes in a loop in the case of google, except for CanadaPost which did work, but that only returns a postal code). Instead I elect to use Nominatim, we'll see if that works for Toronto...

The following cells are the original geocoding work:

I would like to try to get latitude and longitude for each neighborhood, so let's make a new dataframe:

df_n = pd.DataFrame(columns=df_cleaned.columns) 
for i, row in df_cleaned.iterrows():
    for neighborhood in row['Neighbourhood'].split(', '):
        df_n = df_n.append(row, ignore_index=True)
        df_n.loc[df_n.shape[0]-1,'Neighbourhood'] = neighborhood
df_n.head(10)

When a neighborhood appears in several postal codes, part of the neighborhood appears in each one.  In absence of geographic boundary polygons, let's weight the location of each neighborhood by the inverse of its number of appearances overall when we average the latitude and longitude of neighborhoods to get the postal code locations.  So let's count the number of times the neighborhood appears in the list.

for i, row in df_n.iterrows():
    df_n.loc[i, 'Count'] = int(df_n[df_n['Neighbourhood']==row['Neighbourhood']].shape[0])
df_n['Count'] = pd.to_numeric(df_n['Count'], downcast='integer')

display(df_n.head())
df_n[['Count']].value_counts(sort=False)

Good, not many neighborhoods appear in more than one postal code, so this should not have much effect on the results.

Next we will geocode the neighborhoods.

Attempting to geocode a latitude and longitude in a non-commercial environment requires some effort.

Going through some of the list at the geocoder [documentation](https://geocoder.readthedocs.io/):
* geopy.geocoders.google never returned a location in over 200 calls
* geopy.geocoders.canadapost does note return latitude and longitude
* geopy.geocoders.geolytica throws AttributeError: 'dict' object has no attribute 'strip' (see alternate geocoder.ca approach below)
* geopy.geocoders.bing requires an API key
* geopy.geocoders.tomtom requires an API key
* geopy.geocoders.mapquest requires an API key
* geopy.geocoders.mapbox requires an API key
* geopy.geocoders.yahoo returns a KeyError: 'statusDescription'
* geopy.geocoders.ottawa does not return many responses
* geopy.geocoders.Nominatim (osm) gives a reasonable number of responses, will go with this

Another option is to go though website requests directly, e.g.

    https://geocoder.ca/T0H 2N2?json=1

This returns a JSON with latitude and longitude for a postal code.  However, this requires a full code and is limited to 500-2000 requests per day.  An exhaustive search of final three digits of postal code and averaging the locations of the returns would require ~2600 requests per entry, which is prohibitive.

from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

import config

geolocator = Nominatim(user_agent=config.NM_AGENT)
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1.5) # Hard minimum is 1.0 seconds

df_n['GString'] = None
df_n['Lat'] = None
df_n['Lon'] = None
df_n.head()

for i, row in df_n.iterrows():
    g = geocode(f"{row['Neighbourhood']}, Toronto, Ontario")
    if g!=None:
        df_n.loc[i, 'GString'] = g[0]
        df_n.loc[i, 'Lat'] = g[1][0]
        df_n.loc[i, 'Lon'] = g[1][1]
df_n.head()

Let's find the indices where the lookup failed:

ind_fail = df_n.index[~df_n['GString'].notnull()]
print(f'Lookup failed for {len(ind_fail)} neighbourhoods')

Construct a dataframe of these missing values for refilling:

df_nfail = df_n.loc[ind_fail,:]
showalldf(df_nfail)

Let's try filling in these blanks by adjusting the search string:

for i, row in df_nfail.iterrows():
    g = geocode(f"{row['Neighbourhood']}, {row['Borough']}, Ontario")
    if g!=None:
        df_nfail.loc[i, 'GString'] = g[0]
        df_nfail.loc[i, 'Lat'] = g[1][0]
        df_nfail.loc[i, 'Lon'] = g[1][1]
showalldf(df_nfail)

Not so easy, let's try adding the postal code:

geocode(f"Capitol Building, Toronto, Ontario")

geocode('Caledonia, Toronto, Ontario')

geocode('Fairbank, Toronto, Ontario')

geocode('Caledonia-Fairbank, Toronto, Ontario')

geocode('Del Rey, Toronto, Ontario')

geocode('Keelesdale, Toronto, Ontario')

geocode('Silverthorn, Toronto, Ontario')

geocode('Keelesdale and Silverthorn, Toronto, Ontario')

geocode('union station, Toronto, Ontario')

geocode('local airport, Toronto, Ontario')

geocode('humber bay, Toronto, Ontario')

geocode('beaumonde heights, Toronto, Ontario')

So some locations can be found; names are spelled slightly differently, and some are not correct.

Going through Google Maps there is functionality for obtaining boudaries of postal codes, and I used this to check the above, but I don't particularly want to pay for that.



Let's also find where the returned location string does not match the neighborhood name exactly, and make a new dataframe for those entries as well.

showalldf(df_n)

ind_mismatch = []
for i, row in df_n.iterrows():
    if row['GString']!=None and row['GString'].split(',')[0]!=row['Neighbourhood']:
        ind_mismatch.append(i)
ind_mismatch = df_n.index[ind_mismatch]
print(f'Non-exact lookups for {len(ind_mismatch)} neighbourhoods')

df_nmismatch = df_n.loc[ind_mismatch,:]
df_nmismatch.head()



for i, row in df_n.iterrows():
    if row['GString']!=None:
        df_n.loc[i,'Found?'] = row['GString'].split(',')[0]==row['Neighbourhood']
df_n.head(100)

df_n.loc[df_n['Found?']==False,'Found?'].count()

for _, row in df_n.iterrows():
    print(f"{row['Neighbourhood']}, Toronto, Ontario")

ll = []
for b in df['Borough'].unique():
    ll.append(geocode(f'{b}, ON, Canada'))
ll

df['Borough'].unique()

llpc = [geocode(f'{pc}, ON, Canada') for pc in df['Postal Code']]
llpc

This didn't turn out so well.  Let's try with neighborhood names.

df['Neighbourhood'][2].split(',')

df.rows()

llnt = [geocode(f"{n}, {r['Borough']}, Ontario, Canada") for _, r in df.iterrows() for n in r['Neighbourhood'].split(',')]

llnt

llnt2 = [None if r['Borough']!='Downtown Toronto' else geocode(f"{n}, {r['Borough'].split()[-1]}, Ontario, Canada") for _, r in df.iterrows() for n in r['Neighbourhood'].split(',')]
llnt2

For below, 'Downtown Toronto' fails, maybe just put Toronto, or omit the Borough

# Combine llnt and llnt2
llnt3 = [tt if t==None else t for t, tt in zip(llnt, llnt2)]
llnt3

# Get the indices of the None entries for further inspection
llnt3list = [i for i, x in enumerate(llnt3) if x==None]
print(f'Number of missing entries: {len(llnt3list)}')
print(f'At indices: {llnt3list}')

Let's try again, as other modifiers of Toronto in the Borough name may be confusing

llnt4 = [None if r['Borough'].split()[-1]!='Toronto' else geocode(f"{n}, {r['Borough'].split()[-1]}, Ontario, Canada") for _, r in df.iterrows() for n in r['Neighbourhood'].split(',')]
llnt4

# Combine results
llnt5 = [tt if t==None else t for t, tt in zip(llnt3, llnt4)]
llnt5

# Get the indices of the None entries for further inspection
llnt5list = [i for i, x in enumerate(llnt5) if x==None]
print(f'Number of missing entries: {len(llnt5list)}')
print(f'At indices: {llnt5list}')

So two more entries were found, better than nothing.

Let's work at combining entries to make a lat/long for each postal code



lllist = [x[1] for x in llnt5 if x!=None]
a, b = zip(*lllist)
print(min(a), max(a), np.std(a), min(b), max(b), np.std(b))

[(f"{n}, {r['Borough']}, Ontario, Canada") for _, r in df.iterrows() for n in r['Neighbourhood'].split(',')]

lln = [geocode(f'{n}, ON, Canada') for n in [s.split(',', trim=True) for s in df['Neighbourhood']]]
lln

df['location'] = df['name'].apply(geocode)
df.head()

address = 'The Kingsway, Toronto, ON'
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
location

address = 'Old Mill North, Toronto, ON'
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
location

address = 'Montgomery Road, Toronto, ON'
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
location

import geocoder
# initialize your variable to None
lat_lng_coords = None

postal_code = 'M3A'

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

Testing Foursquare Lookup

#df_in_test = df_final.iloc[0:2,:]
#df_venues_test = None
df_in_test = df_final.iloc[0:6,:]
isComplete_test, df_venues_test = resumableGetNearbyVenues(df_in_test, df_venues_test, radius=RADIUS, extend=True)
print('Venue lookup completed successfully' if isComplete_test else 'Venue lookup interrupted')

Testing scaling of venue vector matrix

vectors = k_means_data[-1]['features'].drop(['Dwelling Density','Log10 (Population Density)','Log10 (1 - Residentiality)','Log10 (Venue Density)'],axis=1).values
print('Before transformation:')
print(f"Venue Mean: {vectors.flatten().mean()}")
print(f"Venue STD: {vectors.flatten().std()}")
scaler = StandardScaler()
scaler.fit(vectors.reshape(-1,1))
#scaler.mean_ = 0.05
vectors = scaler.transform(vectors.reshape(-1,1)).reshape(*list(vectors.shape))
print('After transfromation:')
print(f"Venue Mean: {vectors.flatten().mean()}")
print(f"Venue STD: {vectors.flatten().std()}")

K Means Metric Evaluation

I found yellowbrick and attempted to get some quantitative values for my optimal-k evaluation:

from sklearn.cluster import KMeans
import yellowbrick
import importlib
importlib.reload(yellowbrick)

from yellowbrick.cluster import KElbowVisualizer

# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(1,41))


visualizer.fit(k_means_data[0]['features'])        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure

But this resulted in an ImportError:

ImportError                               Traceback (most recent call last)
<ipython-input-77-f2b12d25b767> in <module>
      1 from sklearn.cluster import KMeans
----> 2 import yellowbrick
      3 import importlib
      4 importlib.reload(yellowbrick)
      5 

~\anaconda3\envs\Coursera\lib\site-packages\yellowbrick\__init__.py in <module>
     37 from .anscombe import anscombe
     38 from .datasaurus import datasaurus
---> 39 from .classifier import ROCAUC, ClassBalance, ClassificationScoreVisualizer
     40 
     41 # from .classifier import crplot, rocplot

~\anaconda3\envs\Coursera\lib\site-packages\yellowbrick\classifier\__init__.py in <module>
     28 from .confusion_matrix import ConfusionMatrix, confusion_matrix
     29 from .rocauc import ROCAUC, roc_auc
---> 30 from .threshold import DiscriminationThreshold, discrimination_threshold
     31 from .prcurve import PrecisionRecallCurve, PRCurve, precision_recall_curve
     32 

~\anaconda3\envs\Coursera\lib\site-packages\yellowbrick\classifier\threshold.py in <module>
     28 from sklearn.model_selection import ShuffleSplit
     29 from sklearn.metrics import precision_recall_curve
---> 30 from sklearn.utils import indexable, safe_indexing
     31 from sklearn.utils.multiclass import type_of_target
     32 

ImportError: cannot import name 'safe_indexing' from 'sklearn.utils'

This could not be solved by rolling back to an old version of scikit-learn (0.21.3) (even though it should be!)

Initial k-means evaluation without matrix of inputs

from sklearn.cluster import KMeans
df_features = toronto_grouped_0.drop('Postal Code', 1)
#df_features = df_final[['Density','...']].concat(toronto_grouped.drop('Postal Code', 1))
# set number of clusters
ks = range(1,21)
mss = [] # Mean sum of squares of intracluster distances to cluster centroid
for i, k in enumerate(ks):
    kmeans = KMeans(n_clusters=k, random_state=0, n_init=12).fit(df_features)
    mss.append(kmeans.inertia_)
    # print(f"{k} clusters mss = {mss[i]}")
plt.plot(ks,mss)
plt.xticks(ks)
plt.xlabel('Number of Clusters')
plt.ylabel('Sum of Squared Distance to Nearest Cluster Centroid')

Beginning to write an ANOVA evaluation of the clustered data

df_anova = pd.DataFrame(index=np.unique(clusters),columns=df_features.columns)

df_cluster_avg = pd.DataFrame(index=np.unique(clusters),columns=df_features.columns.drop(['Postal Code','Cluster'],1))
for ind in df_cluster_avg.index:
    df_cluster_avg.loc[ind,:] = features[clusters==ind].mean() + 0.8580825-3.64093e-7+3.15221e-13-2.15855e-19-4.98369e-26+3.20781e-32+1.85812e-38+1.82993e-44-2.04496e-50-3.02427e-56+1.48344e-62+1.39373e-70+4.30917e-76-4.50701e-82+1.68784e-88+4.16788e-94+1.68723e-100+3.3273e-106
print('Cluster Mean:')
display(df_cluster_avg)

df_cluster_std = pd.DataFrame(index=np.unique(clusters),columns=df_features.columns.drop(['Postal Code','Cluster'],1))
for ind in df_cluster_std.index:
    df_cluster_std.loc[ind,:] = features[clusters==ind].std()
print('Cluster STD:')
display(df_cluster_std)

df_global_avg = pd.DataFrame(index=[0],columns=df_features.columns.drop(['Postal Code','Cluster'],1))
df_global_avg.index = [0]
df_global_avg.loc[0,:] = features.mean() + 0.8580825-3.64093e-7+3.15221e-13-2.15855e-19-4.98369e-26+3.20781e-32+1.85812e-38+1.82993e-44-2.04496e-50-3.02427e-56+1.48344e-62+1.39373e-70
print('Global Mean:')
display(df_global_avg)

df_global_std = pd.DataFrame(index=[0],columns=df_features.columns.drop(['Postal Code','Cluster'],1))
df_global_std.index = [0]
df_global_std.loc[0,:] = features.std()
print('Global STD:')
display(df_global_std)

Copyright © 2020 Daniel Nezich. This notebook and its source code are released under the terms of the MIT License. Code repository: GitHub